rm(list = ls())
We focus on nonparametric kernel regression and consider the Mean (Integrated) Squared Error as a measure of risk.
We consider the problem of estimating the following function:
f <- function(x) sqrt(x)*sin(pi*x)
curve(f, ylim = c(0, 1))
n <- 200
X <- 1:n/n
The number of observations is \(n = 200\) and we consider a fixed design: \(X_i = i/n, i=1 \dots n\). We generate a sample by adding Gaussian errors with variance 0.2 to the true function \(f\).
sigma <- 0.2
sigma2 <- sigma^2
eps <- rnorm(n, sd = sqrt(sigma2))
Y <- f(X) + eps
plot(X, Y)
The Nadaraya-Watson (NW) estimator of \(f\) at \(x \in [0,1]\) associated to the kernel \(K\) is:
\[\hat{f}_n(x) = \frac{\sum_{i=1}^n Y_i K\left(\frac{X_i - x}{h}\right)}{\sum_{i=1}^n K\left(\frac{X_i - x}{h}\right)}, \]
where \(h>0\) is called the bandwidth and \(K: \mathbb{R} \to \mathbb{R}\) is a kernel. We focus here on the Gaussian kernel.
The Mean Squared Error (MSE) at a point is defined by:
\[MSE(x) = \mathbb{E}_f \left[ \left(\hat{f_n}(x) - f(x)\right)^2 \right],\]
and the integrated risk, the Mean Integrated Squared Error (MISE), is defined by:
\[MISE = \mathbb{E}_f \left[ \int_0^1 \left(\hat{f_n}(x) - f(x)\right)^2 dx\right]\]
Just like during the course, we consider a discretized version of the MISE:
\[MISE^D = \mathbb{E}_f \left[\frac{1}{n} \sum_{i=1}^n \left(\hat{f_n}(X_i) - f(X_i)\right)^2 \right]\]
There are two main R functions for kernel nonparametric regression:
ksmooth implements the Nadaraya-Watson kernel estimate with rectangular or Gaussian kernel.locpoly implements local polynomial estimators with Gaussian kernel.We will only use the function locpoly. It is part of the package KernSmooth, so we load this package:
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
We begin by looking at the help pages of these functions:
# ?locpoly
The three most important arguments for us are x, y, bandwidth, and degree.
We will work with the following custom function:
NW <- function(x, y, bandwidth, gridsize = length(x), ...) {
locpoly(x, y, degree = 0, bandwidth = bandwidth, gridsize = gridsize, ...)
}
We calculate the Nadaraya-Watson estimator and plot it along with the data points:
fit <- NW(X, Y, bandwidth = 0.3)
plot(X, Y)
curve(f, add = TRUE, col = 2, lty = 2)
lines(fit, col=3)
lgd <- c("data points", "true f", "NW estimation")
legend("bottom", lgd, col = 1:3, lty = c(NA, 2, 1), pch = c(1, NA, NA))
Let us admit for a while that a “good choice” for the bandwidth in the NW estimator is \(h^\star = 0.057\). In order to further asses the influence of the bandwidth on the quality of estimation, we compare this choice to two other choices \(h = 1/C \times h^\star\) and \(h = C \times h^\star\), for \(C=10\).
par(lwd = 2)
plot(X, Y, col = "lightgray")
curve(f, add = TRUE, col = 1, lty = 2)
best <- 0.057
C <- 10
bwd <- c(1/C, 1, C)*best
for (kk in 1:length(bwd)) {
fit <- NW(X, Y, bandwidth = bwd[kk])
lines(fit, col = 1 + kk)
}
lgd <- c("h = 1/C x h*", "h = h*", "h = C x h*")
legend("bottom", lgd, col = 2:4, lty = 1)
C_list <- c(0.1, 0.5, 1, 5)
par(mfrow = c(1,1))
for ( C in C_list )
{
plot(X, Y, col = "lightgray", main = paste("C = ", C))
curve(f, add = TRUE, col = 1, lty = 2)
bwd <- c(1/C, 1, C)*best
for (kk in 1:length(bwd)) {
fit <- NW(X, Y, bandwidth = bwd[kk])
lines(fit, col = 1 + kk)
}
lgd <- c("h = 1/C x h*", "h = h*", "h = C x h*")
legend("topright", lgd, col = 2:4, lty = 1)
}
For the smallest bandwidth values, the estimated function fluctuates a lot (ex. the blue curve for C = 0.1). This is because for a low bandwidth, the estimator makes calculations based on a very narrow interval and sticks to the observations too much.
For the largest bandwidth values, the estimated function is almost constant (ex. the blue curve for C = 5). Indeed, with a very large bandwidth, the estimators uses intervals that are too wide, it doesn’t take into account the observations enough.
As a conclusion: - With a bandwidth that is too small, the model is overfitting the data - With a bandwidth that is too large, it is underfitting the data
The objective is to find the optimal bandwidth.
The goal of this section is to estimate the optimal bandwidth for the NW estimator, and justify the choice \(h^\star = 0.057\) in the preceding section. We measure the quality of estimation by the discretized MISE:
\[MISE^D = \mathbb{E}_f \left[ \frac{1}{n} \sum_{i=1}^n \left(\hat{f_n}(X_i) - f(X_i)\right)^2 \right]\]
If \(f\) was known, then we could estimate the discretized MISE by the following quantity:
\[\frac{1}{n} \sum_{i=1}^n \left(\hat{f_n}(X_i) - f(X_i)\right)^2\]
To find the optimal bandwidth, we simply minimize MISE_NW as a function of \(h\).
opt <- optimize(MISE_NW, X, Y, f, interval = c(0, 1))
best <- round(opt$minimum, 3)
best
## [1] 0.049
X and Y?The above calculation can’t be done in practice as it requires the knowledge of the function f. In our case, f is the objective so we can’t use it in the estimator.
First, let’s generate new X and Y with 2000 observations
X_2000 <- 1:2000/2000
eps <- rnorm(n, sd = sqrt(sigma2))
Y_2000 <- f(X_2000) + eps
plot(X_2000, Y_2000)
And calculate the new optimal bandwidth.
opt_2000 <- optimize(MISE_NW, X_2000, Y_2000, f, interval = c(0, 1))
best_2000 <- round(opt_2000$minimum, 3)
paste ("Optimal bandwith with 2000 observations = ", best_2000)
## [1] "Optimal bandwith with 2000 observations = 0.023"
Then plot the estimated function f as well as the real f.
plot(X_2000, Y_2000, col = "lightgray", main = paste("f vs estimated f with optimal bandwidth = ", best_2000))
curve(f, add = TRUE, col = 1, lty = 2)
bwd <- best_2000
fit <- NW(X_2000, Y_2000, bandwidth = bwd)
lines(fit, col ="red")
The estimated curve is really close to the real function f. The estimator did a good job at estimating the function.
We know that the gaussian kernel has an order equal to 1. Based on the 2 measurments of the optimal bandwidth, we can estimate the beta coefficient ad check that its integer part is equal to 1. We know that \[h_n^* = C_{te} \times n^{\frac{-1}{2\beta+1}}\] Then as we’ve measured \(h^*\) for \(n=200\) and \(n=2000\) we can estimate \(\beta\) \[\beta = \frac{1}{2} \times (1 - \frac{\ln(\frac{n_1}{n_2})}{\ln(\frac{h_{n1}^*}{h_{n2}^*})})\] We first estimate a mean optimal bandwidth over various samples so that it isn’t too reliant on the noice. Then we calculate the beta coefficient.
measure_mean_opt_bwd <- function (n, nRep) {
best_bwd_list <- c()
for (i in 1:nRep) {
X <- 1:n/n
eps <- rnorm(n, sd = sqrt(sigma2))
Y <- f(X) + eps
opt <- optimize(MISE_NW, X, Y, f, interval = c(0, 1))
best <- round(opt$minimum, 3)
best_bwd_list[length(best_bwd_list) + 1] <- best
}
return (mean(best_bwd_list))
}
n_1 <- 200
n_2 <- 2000
h_n1 <- measure_mean_opt_bwd(n_1, 10)
h_n2 <- measure_mean_opt_bwd(n_2, 10)
beta <- 0.5 * (1 - log(n_1 / n_2) / log(h_n1 / h_n2) )
print(paste ("Estimated beta coefficient = ", round(beta,3)))
## [1] "Estimated beta coefficient = 2.587"
Even by doing a lot of simulations, we always get a beta whose entiere part is equal to 2. This is a surprising result as we would’ve expected beta = 1.
With the \(\beta\) known, we can estimate the value of any sample size : \[h_{n2}^* = h_{n1}^* \times (\frac{n_2}{n_1})^{\frac{-1}{2\beta+1}}\]
n_3 <- 20000
h_n3 <- h_n1 * (n_3 / n_1) ^( -1 / (2*beta +1) )
print(paste("Estimated optimal bandwidth for n = ", n_3, "=", round (h_n3, 3)))
## [1] "Estimated optimal bandwidth for n = 20000 = 0.021"
n_4 <- 200000
h_n4 <- h_n1 * (n_4 / n_1) ^( -1 / (2*beta +1) )
print(paste("Estimated optimal bandwidth for n = ", n_4, "=", round (h_n4, 3)))
## [1] "Estimated optimal bandwidth for n = 2e+05 = 0.014"
We consider the following estimator of the discretized MISE:
\[\widehat{MISE}^{D,0} = \frac{1}{n} \sum_{i=1}^n \left(\hat{f_n}(X_i) - Y_i\right)^2\]
This estimator can be implemented as follows:
MISE_NW_0 <- function(bandwidth, x, y) {
fit <- NW(x, y, bandwidth = bandwidth)
mean((fit$y - y)^2)
}
We define a grid of 100 values for the bandwidth h between 0 and 0.2.
hs <- 1:100/500
1.Calculate the above estimator for all bandwidths, and plot the result as a function of h.
mise_nw_0_list <- c()
for (h in hs){
mise_nw_0_list[length(mise_nw_0_list) + 1] <- MISE_NW_0(h,X,Y)
}
plot(hs, mise_nw_0_list, xlab = "bandwidth", ylab = "Discretized MISE")
2.Comment the results and explain why this estimator cannot be used to estimate an optimal bandwidth.
The estimator can’t be used to find the hs that gives the minimum MISE. The lowest MISE is obtained with h = 0 and we’ve already explained that this is not an option as it leads to an overfitting model.
However, we can split the curve in 3 intervals. - For \(h<0.3\), the MISE increases a lot with h. This is where the estimator overfits - For \(h \in [0.3;0.6]\), the curve is rather constant - For \(h>0.6\), the MISE increases a lot again. This is where the estimator underrfits
So even if we can’t find the h that gives the minimal MISE, we can graphically assume that the optimal bandwidth is between 0.3 and 0.6, which is the result obtained.
The estimator at Question 2 is based on the knowledge of \(f\) and a single realization \((X,Y)\). To estimate the discretized MISE when the simulation model is known, we propose in this section to replace the expectation \(\mathbb{E}_f\) in the definition of the discretized MISE by an empirical mean over simulation runs.
We start by defining a function that performs one simulation run and outputs the associated NW estimator:
fHat <- function(bandwidth, n, sigma2, f) {
eps <- rnorm(n, sd = sqrt(sigma2))
X <- 1:n/n
Y <- f(X) + eps
fit <- NW(X, Y, bandwidth = bandwidth)
fit$y
}
We display the result of three simulation runs:
curve(f, ylim = c(-0.1, 0.8))
lines(X, fHat(0.1, n, sigma2, f), col = 2, lty = 2)
lines(X, fHat(0.1, n, sigma2, f), col = 3, lty = 2)
lines(X, fHat(0.1, n, sigma2, f), col = 4, lty = 2)
Following this idea, we generate and plot a matrix of realizations of \(\hat{f}\):
bandwidth <- 0.3
fHatMat <- replicate(101, fHat(bandwidth, n, sigma2, f))
curve(f)
matlines(X, fHatMat, col = "lightgray")
curve(f, add = TRUE)
bandwidth <- 0.04
fHatMat <- replicate(101, fHat(bandwidth, n, sigma2, f))
curve(f)
matlines(X, fHatMat, col = 3)
curve(f, add = TRUE)
These are 100 estimations of f with a bandwidth = 0.04. This value of the bandwidth is close to the calculated optimum bandwidth, that’s why the estimations are fine.
bandwidth <- 0.004
fHatMat <- replicate(101, fHat(bandwidth, n, sigma2, f))
curve(f)
matlines(X, fHatMat, col = 7)
curve(f, add = TRUE)
These are 100 estimations of f with a bandwidth = 0.004. 0.004 is a very low bandwidth and the estimated functions fluctuate a lot. As explained previoulsy, the esimator is overfitting the data.
bandwidth <- 0.01
fHatMat <- replicate(101, fHat(bandwidth, n, sigma2, f))
curve(f)
matlines(X, fHatMat, col = 7)
curve(f, add = TRUE)
With a bandwidth = 0.01, we are overfitting as expected, the bandwidth is too small
bandwidth <- 0.5
fHatMat <- replicate(101, fHat(bandwidth, n, sigma2, f))
curve(f)
matlines(X, fHatMat, col = 7)
curve(f, add = TRUE)
With a bandwidth = 0.2, the model is underfitting as it is too large.
diffs <- fHatMat - f(X) ## Note: this is a bit dangerous
diffs2 <- sweep(fHatMat, 1, f(X)) ## Safer but harder to understand
max(abs(diffs2-diffs)) ## sanity check
## [1] 0
str(rowMeans(diffs^2))
## num [1:200] 0.162 0.161 0.16 0.158 0.156 ...
We encapsulate the above in a function:
MSE <- function(bandwidth, n, sigma2, f, nRep) {
fHatMat <- replicate(nRep, fHat(bandwidth, n, sigma2, f))
diffs <- sweep(fHatMat, 1, f(X))
mse <- rowMeans(diffs^2)
}
We recall that we have the following bias-variance decomposition:
\[ MISE = \int_0^1 b^2(x) dx + \int_0^1 \sigma^2(x) dx,\]
where \(b(x) = \mathbb{E}_f \left[\hat{f_n}(x) \right]- f(x)\) and \(\sigma^2(x) = \mathbb{E}_f \left[(\hat{f_n}(x) - \mathbb{E}_f (\hat{f_n}(x)))^2 \right]\).
What do the following quantities MSHh1,2,3 and Estim1,2,3 contain ?
MSEh1 <- MSE(0.3, n, sigma2, f, 1001)
MSEh2 <- MSE(0.04, n, sigma2, f, 101)
MSEh3 <- MSE(0.003, n, sigma2, f, 101)
Estim1=sum(MSEh1)/n
Estim2=sum(MSEh2)/n
Estim3=sum(MSEh3)/n
Estim1
## [1] 0.03764985
Estim2
## [1] 0.001690027
Estim3
## [1] 0.02009015
MSEh1.2.3 are vector of size n containing the MSE for each observation over the nRep contained in fhatmat. Estim1,2,3 are scalars equal to the mean of the n MSE calculated over each observation. Estim is the MISE - MSEh1 is the vector of n MSE for a bandwidth \(h1 = 0.3\) and 1000 repetitions. This is a large bandwidth and the model is underfitting, that’s why the MSE is large. - MSEh2 is the vector of n MSE for a bandwidth \(h2 = 0.04\) and 100 repetitions. This is close to the optimal bandwidth and the MSE is low. - MSEh3 is the vector of n MSE for a bandwidth \(h3 = 0.003\) and 100 repetitions. This is a low bandwidth and the model is overfitting. That’s why the MSE is larger than with an optimal bandwidth.
Can this estimator of MSE be used in practice? No it can’t be used in practice, because it is once again using f to calculate b.
Estimate \(b^2:=\int_0^1 b^2(x)dx\), \(\sigma^2:=\int_0^1 \sigma^2(x)dx\) and plot them as a function of \(h\) on the same figure along with \(MISE\). Comment on this figure in terms of bias-variance tradeoff.
calculate_bias_and_variance <- function (nRep, X_vect) {
bias_list <- c()
var_list <- c()
n <- length(X_vect)
for (h in hs){
fHatMat <- replicate(nRep, fHat(h, n, sigma2, f))
FHatMat_esp <- rowMeans(fHatMat)
bias <- FHatMat_esp - f(X_vect)
diffs <- sweep(fHatMat, 1, FHatMat_esp)
var <- rowMeans(diffs^2)
bias_list[length(bias_list) + 1] <- sum(bias**2)/n
var_list[length(var_list) +1 ] <- sum(var)/n
}
return( list ("bias" = bias_list, "var" = var_list) )
}
nRep <- 101
hs <- 1:100/500
bias_and_var <- calculate_bias_and_variance(nRep, X)
bias_list <- bias_and_var$bias
var_list <- bias_and_var$var
mise_list <- bias_list + var_list
plot(x = hs, y = bias_list, col = 2, ylab = "", type = "l", lwd = 2)
points(x = hs, y = var_list, col = 3, type = "l", lwd = 2)
points(x=hs, y = mise_list, col = 4, type = "l", lwd = 2)
legend(x = "topleft", lty = 1 , legend = c("Bias", "Variance", "MISE"), col = 2:4)
h_opt_mise <- hs[which.min(mise_list)]
print(paste("The bandwidth that minimizes the mise as bias + variance is", h_opt_mise))
## [1] "The bandwidth that minimizes the mise as bias + variance is 0.05"
print(paste("REMINDER : The bandwidth that was estimated previously as optimum was", best))
## [1] "REMINDER : The bandwidth that was estimated previously as optimum was 0.049"
We now focus at the pointwise risk (MSE). We can visualize the influence of the choice of \(h\) directly on MSE. Again, we compare the “optimal” bandwidth to two other choices \(h = 1/C \times h^\star\) and \(h = C \times h^\star\), for a given \(C>1\).
Execute then explain what represent the following plot. You may change C also.
par(lwd = 2)
C <- 3
plot(X, MSE(best/C, n, sigma2, f, 50), t = 'l', ylab = "MSE", col = 2, ylim = c(0, 0.1))
lines(X, MSE(best, n, sigma2, f, 50), col = 3)
lines(X, MSE(C*best, n, sigma2, f, 50), col = 4)
lgd <- c("h = 1/C x h*", "h = h*", "h = C x h*")
legend("top", lgd, col = 2:4, lty = 1)
abline(h=0, lty = 2)
This plots represents the MSE as a function of X: For each X, nRep evaluations of the function f are made, and a MSE is calculated from these nRep values. The plots show that the MSE is really low from 0.2 to 0.8 whatever h, but on the boundaries of the interval, it increases.
Let’s try it with other values for C:
par(lwd = 2, mfrow = c(1,1))
C_list <- c(1,2,4, 8)
for (C in C_list) {
plot(X, MSE(best/C, n, sigma2, f, 50), t = 'l', ylab = "MSE", col = 2, ylim = c(0, 0.1), main = paste("C = ", C))
lines(X, MSE(best, n, sigma2, f, 50), col = 3)
lines(X, MSE(C*best, n, sigma2, f, 50), col = 4)
lgd <- c("h = 1/C x h*", "h = h*", "h = C x h*")
legend("top", lgd, col = 2:4, lty = 1)
abline(h=0, lty = 2)
}
The boundary effects appear whatever the value of C is. We can even notice that for the greatest C, the MSE also increases in the center of the interval. That is because a large C implicates very small (h/C) or large (h x C) bandwidth which are situations of overfitting or underfitting.
par(lwd = 2, mfrow = c(1,1))
plot(X, MSE(best, n, sigma2, f, 50), col = 1)
lgd <- c("h = h*")
legend("top", lgd, col = 1, lty = 1)
abline(h=0, lty = 2)
We notice the boundary effects commented in the previous question. There is a systematic bias near the boundaries of the interval [0,1]. Appart from these boundary effects, the MSE is systematically low.
bandwidth <- best
fHatMat <- replicate(101, fHat(bandwidth, n, sigma2, f))
curve(f)
matlines(X, fHatMat, col = 7)
curve(f, add = TRUE)
The NW estimator can be seen as a weighted average of the \(Y_i\) where the weights are decided by a Gaussian kernel. The further the points are from \(x\), the lower the weight in the estimation.
For instance, with \(x = 1/2\), a large part of the interval is taken into account to calculate the estimator. Moreover, this is the central zone in which the function is maximum so a large part of the information is used.
But in the boundaries, a small proportion of the \(X_i\) is really used because half of the gaussian centred on 0 or 1 is off limits. Moreover, the function f is very close to 0 for \(x \sim 0\) or \(x \sim 1\). So the majority of the observations information is very low weighted and not taken into account for the calculation of the estimator.
That’s why the estimator isn’t great at the boundaries for this function.
When the bandwidth is rather large, then the variance of the gaussian kernel increases. That means that for a given \(x\), more weight will be given to the \(X_i \sim x\) that are far from \(x\). If the variance increases too much, then the weight given to points that are too far is too important and the model underfits.
We expect the boundary effect to be more important in that case as explained in the previous question.
bandwidth <- best * 2
fHatMat <- replicate(101, fHat(bandwidth, n, sigma2, f))
curve(f)
matlines(X, fHatMat, col = 7)
curve(f, add = TRUE)
par(lwd = 2, mfrow = c(1,1))
plot(X, MSE(best, n, sigma2, f, 50), col = 1)
lgd <- c("h = 2 * h*")
legend("top", lgd, col = 2:4, lty = 1)
abline(h=0, lty = 2)
As seen on the plot, the boundary effect increases even at \(h = 2*h_{opt}\).
fHat_LP that generates a local polynomial regression estimator, and functions MSE_LP and MISE_LP that estimate the associated MSE and MISE, respectively.my_NW_LP <- function(x, y, bandwidth, degree ,gridsize = length(x), ...) {
locpoly(x, y, degree = degree, bandwidth = bandwidth, gridsize = gridsize, ...)
}
fHat_LP <- function(bandwidth, n, sigma2, f, degree) {
eps <- rnorm(n, sd = sqrt(sigma2))
X <- 1:n/n
Y <- f(X) + eps
fit <- my_NW_LP(X, Y, bandwidth = bandwidth, degree)
fit$y
}
MISE_LP <- function(bandwidth, x, y, f, degree) {
fit <- my_NW_LP(x, y, bandwidth = bandwidth, degree)
mean((fit$y - f(fit$x))^2)
}
MSE_LP <- function(bandwidth, n, sigma2, f, nRep, degree) {
fHatMat <- replicate(nRep, fHat_LP(bandwidth, n, sigma2, f, degree))
diffs <- sweep(fHatMat, 1, f(X))
mse <- rowMeans(diffs^2)
}
degree_list <- 0:2
optim_bandwidth_list <- c()
print(paste("REMINDER: the optimal bandwitdh estimated before was ", best))
## [1] "REMINDER: the optimal bandwitdh estimated before was 0.049"
for (degree in degree_list) {
opt <- optimize(MISE_LP, X, Y, f, degree,interval = c(0, 1))
best_bwd <- round(opt$minimum, 3)
print( paste ("The optimal bandwidth for LP estimators of degree ", degree, " = ",best_bwd) )
optim_bandwidth_list[ length(optim_bandwidth_list) + 1 ] <- best_bwd
}
## [1] "The optimal bandwidth for LP estimators of degree 0 = 0.049"
## [1] "The optimal bandwidth for LP estimators of degree 1 = 0.072"
## [1] "The optimal bandwidth for LP estimators of degree 2 = 0.171"
par(mfrow = c(1,1))
plot(X, MSE(best, n, sigma2, f, 50), col = 1, type = "l")
abline(h=0, lty = 2)
for ( i in 1:length(degree_list) ){
points(X, MSE_LP(optim_bandwidth_list[i], n, sigma2, f, 50, degree_list[i]), col = 1 + i,
main = paste("Degree = ", degree_list[i]), ylab = "MSE LP", ylim = c(0,0.2), type = "l", lwd = 2)
abline(h=0, lty = 2)
}
legend( x = "topleft", legend = c("Previous MSE", "LP deg 0", "LP deg 1", "LP deg 2"), col = 1:4, lty = 1)
Even though it still exists, the boundary effects decreases significantly with a LP polynom of degree 1 or 2.
Recall that a linear nonparametric estimator is an estimator that may be written as
\[ \hat{f}_{n,h}(x) = \sum_{i=1}^{n} Y_i W_{ni}(x, h) \]
(It is linear in \(Y\).) A nice property of linear NP estimators is that the leave-one-out cross-validation risk may be explicitly written as
\[CV(h) = \frac{1}{n} \sum_{i=1}^n \left(\frac{Y_i - \hat{f}(X_i)}{1-W_{ni}(X_i, h)} \right)^2\]
Local polynomial estimators are instances of linear NP estimators.
In the case of local polynomials, the matrix of weights \(W=(W_{ni}(X_j))_{i,j}\) may be obtained using a simple R function
getW <- function(X, h) {
n <- length(X)
W <- matrix(NA, n, n)
for (ii in 1:n){
Yi <- rep(0, n)
Yi[ii] <- 1
fit <- locpoly(X, Yi, bandwidth = h, gridsize = n)
W[ii, ] <- fit$y
}
W
}
In this code, W is filled line by line. for the line ii, the weights are calculated thanks to a vector Y built such as \(Y_i[ii] = 1\) but \(Y_I[i] = 0\) if \(i \neq ii\). Then the line ii is filled with terms that depend on \(X_j - X_{ii}\), where j is the column number. Each term \(W_{i,j}\) is indeed the weight between \(X_i\) and \(X_j\)
CV_risk <- function(X, Y, h) {
n <- length(X)
W <- getW(X, h)
CV <- 0
for (i in 1:n) {
fHat_Xi <- sum(Y*W[,i])
numerator <- Y[i] - fHat_Xi
denominator <- 1 - W[i,i]
CV <- CV + (numerator/denominator)**2
}
return (CV/n)
}
CV_risk(X, Y, best)
## [1] 0.04321478
par(mfrow = c(1,1))
CV_risk_list <- c()
for ( h in hs ) {
CV_risk_list [length(CV_risk_list) + 1] <- CV_risk(X, Y, h)
}
plot(hs, CV_risk_list)
CV_risk_optimal_bwd <- hs[which.min(CV_risk_list)]
associated_mse_CV <- sum(MSE(CV_risk_optimal_bwd, n, sigma2, f, 50))/n
associated_mse_opt_bwd <- sum(MSE(best, n, sigma2, f, 50))/n
print(paste ("The bandwidth that minimizes the CV risk is ", CV_risk_optimal_bwd, "with MSE = ", associated_mse_CV ) )
## [1] "The bandwidth that minimizes the CV risk is 0.056 with MSE = 0.00171053302694687"
print( paste("REMINDER : The bandwidth that minimizes the MSE is ", best, "with MSE = ", associated_mse_opt_bwd) )
## [1] "REMINDER : The bandwidth that minimizes the MSE is 0.049 with MSE = 0.00167178533382924"
The bandwidth obtained thanks to the Cross Validation should be used as it was calculated with a method that does not require f to be known.
The function locpoly can also calculate derivatives of the regression:
fit <- locpoly(X, Y, drv = 1, bandwidth = 0.1, gridsize=length(X))
plot(fit$x, fit$y)
ff <- function(x) sin(pi*x)/sqrt(x)/2 + pi*sqrt(x)*cos(pi*x)
curve(ff, add = TRUE)
It is easy to demonstrate that ff is the analytic derivative of the function f. The curve is the curve of the derivative of the function f.
The argument drv = 1 means that the locpoly function will try to predict the first derivative of the function based on the observations X and Y.
The black points are the estimated derivative of the function.
Let’s simulate a realization of the function ff for the same vector X
YY <- ff(X) + eps
plot(X, YY, main = "Realization of ff")
curve(ff, add = TRUE)
Then let’s evaluate the MSE and MISE for various degrees of local polynomials
degree_list <- 0:2
df <- data.frame( "degree" = degree_list, "MSE" = rep(0, length(degree_list)), "MISE" = rep(0, length(degree_list)))
for (degree in df$degree) {
mse_der <- sum( MSE_LP(best, n, sigma2, ff, 50, degree)) / n
mise_der <- MISE_LP(best, X, YY, ff, degree)
df[df$degree == degree, c("MSE", "MISE")] <- c(mse_der, mise_der
)
}
df
## degree MSE MISE
## 1 0 0.007505230 0.007943334
## 2 1 0.002741851 0.003609194
## 3 2 0.002149642 0.003149462
We can also plot the MSE alongside the X axis
par(mfrow = c(1,1))
plot(X, MSE(best, n, sigma2, ff, 50), col = 1, type = "l")
abline(h=0, lty = 2)
for ( i in 1:length(degree_list) ){
points(X, MSE_LP(optim_bandwidth_list[i], n, sigma2, ff, 50, degree_list[i]), col = 1 + i,
main = paste("Degree = ", degree_list[i]), ylab = "MSE LP", ylim = c(0,0.2), type = "l", lwd = 2)
abline(h=0, lty = 2)
}
legend( x = "topleft", legend = c("Previous MSE", "LP deg 0", "LP deg 1", "LP deg 2"), col = 1:4, lty = 1)
locpoly?By default the locpoly function returns a vector of values for \(\hat{f}(x)\) for 401 equally-spaced values of \(x\) in \([0,1]\). Here, we want \(\hat{f}(X_i)\) for \(1 \leq i \leq n\), where the \(X_i\) are n equally-spaced values in \([0,1]\):
fit <- locpoly(X, Y, bandwidth = 1)
length(fit$y)
length(fit$y) == length(X)
Our custom function locpoly2 forces the option gridsize n locpoly to match the size of the input design:
fit <- locpoly2(X, Y, bandwidth = 1)
length(fit$y)
length(fit$y) == length(X)
max(abs(X - fit$x))
## [1] 1.110223e-16